Linear regression: Diagnostics and Exemplar 10.3

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Briefly: Assumptions for inference with a regression model

  1. Independent observations
  2. The observations’ residuals are centred at zero and they have a similar measure of spread
  3. The observations’ residuals is approximately Normally distributed

Fitting a regession model to data

We often check the second and third assumptions after fitting any regression model to the data

For simple linear regression: The R function, lm(), is “saying” the following equation, the best-fit line, is appropriate for the data

\[ \begin{aligned} y_i = &~\beta_0 + \beta_1 \times x_i + \varepsilon_{i}, \\ &~ \text{where} ~ \varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon) \end{aligned} \]

By fitting the model first, we note that the residuals, \(\varepsilon_{i}\), are what we want to have a similar spread after accounting for the best-fit line, and we want to be Normally distributed

Details on the equation for simple linear regression

\[ \begin{aligned} y_i = &~\beta_0 + \beta_1 \times x_i + \varepsilon_{i}, \\ &~ \text{where} ~ \varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon) \end{aligned} \]

where:

  • \(y_i\) is value of the response variable for the \(i\)th observation
  • \(x_i\) is value of the explanatory variable for the \(i\)th observation
  • \(\beta_0\) is the y-intercept of the best-fit line
  • \(\beta_1\) is the “gradient” of the best-fit line if we increase \(x_i\) by one
  • \(\varepsilon_i\) is the residual value for the \(i\)th observation
  • \(\text{Normal}(0, \sigma_\varepsilon)\) means that we expect the residuals Normally distributed about 0 and have a common standard deviation \(\sigma_\varepsilon\)

The best estimate of the \(\beta_j\)s are the \(\widehat{\beta}_j\)s after fitting this equation to the data.

The above is achieved when the sum of squares for residuals, \(SSR\), is minimised. which after a few more mathematical steps, gets us out best estimate of \(\sigma_\varepsilon\), which is \(\sqrt{MSR}\)

Diagnostic plot to assess 2.

This diagnostic plot is a scatter plot known as the Residuals versus Fitted plot

If the simple linear regression model is appropriate for the data:

  • The residuals (y-axis) are centred at 0
  • The residuals’ spread does not depend on the fitted value (x-axis)

In this case, an observation’s fitted value is equal to the average value of the response given its observed value for the explanatory

You could check whether “95%” of the residuals are randomly spread between \(0 \pm 2 \times \widehat{\sigma}_\varepsilon\)

summary(crabs.fit)$sigma
[1] 1.996048
# A diagnostic plot to help evaluate the second assumption
plot(crabs.fit, which = 1)

Worked Examples for 2.

Worked Examples for 3.

Histogram of residuals to assess 3.

As per Slides 3–4, it is the residuals that need to be approximately Normally distributed

We should only check this assumption if the second assumption has been satisfied

If the simple linear regression model is appropriate for the data:

  • The residuals are unimodal—One “peak”
  • The shape of the residuals’ distribution is a “bell-shape”

So for CS 10.1 we would trust the inference made with the simple linear regression model

# A histogram to help evaluate the second assumption
MASS::truehist(crabs.fit$residuals, xlab = "Residuals",
               ylab = "Density")
curve(dnorm(x, sd = summary(crabs.fit)$sigma), 
      add = TRUE, lty = 2)

Diagnostic plot to assess 3.

This diagnostic plot is a scatter plot known as the Normal Q-Q plot

For any linear model that assumes the residuals are approximately Normally distributed: The values of observed residuals (y-axis) agree with their theoretical values (x-axis)

  • They need to be approximately exact for “small” datasets (\(n < 20\))
  • The tails of the Normal Q-Q plot can be a kind of distorted for “medium” datasets
    (\(20 \leq n \leq 50\))
  • The tails of the Normal Q-Q plot can be heavily distorted for “large” datasets
    (\(n > 50\))
# A diagnostic plot to help evaluate the third assumption
plot(crabs.fit, which = 2)

Worked Examples for 3.

Worked Examples for 3.

E 10.3: Ironman times

Context

A pair of researchers were interested in whether the world record times for an Ironman changed over time. To investigate this question, they collected data on whenever an athlete broke a world record since 1989. The data contains the following variables for the eight athletes that broke the world record.

Variables
Time A number denoting the world record time (in minutes)
Year A factor denoting the year in which the world record was broken
# Read in the data
ironman.df <- read.csv("datasets/ironman.csv")

# Take a quick peek
head(ironman.df)
  Year     Time
1 1989 481.5333
2 1996 477.0333
3 1997 470.4500
4 2011 465.9667
5 2011 461.5500
6 2016 455.6500
# Number of observations
nrow(ironman.df)
[1] 8

Visualising the data

Describe any features of the data

There is a negative linear relationship between the world record times for an Ironman and when the world record was broken. It is difficult to comment on the strength of the relationship with only eight observations.

xyplot(Time ~ Year, data = ironman.df, xlab = "Year broken",
       ylab = "Time (minutes)",
       main = "World record times for an Ironman versus year broken")

Figure: The times of the 8 Ironman world records

Fitting a (simple) linear regression model

The equation of the fitted model?

\(\text{Time}_{i} = \beta_0 + \beta_1 \times \text{Year}_i + \varepsilon_{i}\), where \(\varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon)\)

# Fit the means-only model to the data
ironman.fit <- lm(Time ~ Year, data = ironman.df)

# Produce a diagnostic plot to help evaluate the second assumption
plot(ironman.fit, which = 1)

# Produce a diagnostic plot to help evaluate the third assumption
plot(ironman.fit, which = 2)

Assumptions for inference

  1. Independent observations
    No details about how the data was provided. Also, is it believable that each observation is associated with a unique individual? I think it is not in this case, as people who break world records tend to break them again after a short time.
  2. The observations’ residuals are centred at zero and they have a similar measure of spread
    Typically for small datasets, justifying this assumption has been met is very difficult. After thinking about it, it should be clear that the residuals are not randomly scattered evenly about 0.
  3. The observations’ residuals is approximately Normally distributed
    We usually do not bother with approximate normally distributed residuals if the previous assumption has not been made, as things can look Normal if we “collapse” the variable. The central component of the Normal Q-Q plot is a bit off, which does suggest that the shape of the residuals’ distribution may not be exactly Normal, so we should investigate further with a histogram of residuals.

Inference* for β1 and β0

β1
With 95% confidence, we estimate a one-year increase decreases the average Ironman world record time by somewhere between 0.74 and 1.42 minutes.

β0
With 95% confidence, we estimate that the average Ironman world record time by somewhere between 1955.83 and 3312.37 minutes at Year 0 (A.D.).

Is the above interpretation sensible though?…

# The summary output
summary(ironman.fit)

Call:
lm(formula = Time ~ Year, data = ironman.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6910 -2.0735  0.5486  1.9747  6.7607 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2634.1026   277.1940   9.503 7.74e-05 ***
Year          -1.0815     0.1381  -7.834 0.000229 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.464 on 6 degrees of freedom
Multiple R-squared:  0.9109,    Adjusted R-squared:  0.8961 
F-statistic: 61.36 on 1 and 6 DF,  p-value: 0.0002286
# The confidence intervals for the regression coefficients
confint(ironman.fit)
                 2.5 %     97.5 %
(Intercept) 1955.83343 3312.37181
Year          -1.41932   -0.74368

Inference* for the response

Multiple R2
Our fitted model explains 91.09% of the variability in the Iron world record times

For when Yeari is 2021
We are 95% sure that the average Ironman world record time made in 2021 is somewhere between 442.5 and 454.3 minutes.

With 95% confidence, we predict that an Ironman world record time made in 2021 is somewhere between 436.0 and 460.8 minutes.

For when Yeari is 2023
We are extrapolating here, so any inference for the response assumes that the relationship between the response and explanatory variables remains linear

Year2021 <- data.frame(Year = 2021)
Year2023 <- data.frame(Year = 2023)

## A 95% CI for a mean response and 95% PI for an individual 
##  response given that the year is 2021
predict(ironman.fit, newdata = Year2021, interval = "c")
      fit      lwr      upr
1 448.391 442.4791 454.3028
predict(ironman.fit, newdata = Year2021, interval = "p")
      fit      lwr      upr
1 448.391 435.9706 460.8113
## A 95% CI for a mean response and 95% PI for an individual 
##  response given that the year is 2023
predict(ironman.fit, newdata = Year2023, interval = "c")
      fit      lwr      upr
1 446.228 439.7894 452.6665
predict(ironman.fit, newdata = Year2023, interval = "p")
      fit      lwr      upr
1 446.228 433.5484 458.9075